[% setvar title TITLE %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Generalised Additions to Regexs</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Richard Proctor &lt;<a href='mailto:richard@waveney.org'>richard@waveney.org</a>&gt;
  Date: 22 Sep 2000
  Last Modified: 1 Oct 2000
  Mailing List: <a href='mailto:perl6-language-regex@perl.org'>perl6-language-regex@perl.org</a>
  Number: 274
  Version: 2
  Status: Frozen</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>This proposes a way for generalised additions to regex capabilities.</p>
<a name='DESCIPTION'></a><h1>DESCIPTION</h1>
<p>Given that expansion of regexes could include (+...) and (*...) I have
been thinking about providing a general purpose way of adding
functionality.  Hence I propose that the entire (+...) syntax is
kept free from formal specification for this. (+ = addition)</p>
<p>A module or anything that wants to support some enhanced syntax
registers a callback that handles &quot;regex enhancements&quot;.</p>
<a name='Basic operation'></a><h2>Basic operation</h2>
<p>There are two ways these could operate:</p>
<a name='My original:'></a><h3>My original:</h3>
<p>My thoughts where to leave the syntax completely open so that anything
could be added - words, chinese, $$$.</p>
<p>At regex compile time, if and when (+foo) is found perl calls
each of the registered regex enhancements in turn, these:</p>
<p>1) Are passed the foo string as a parameter exactly as is.  (There is
an issue of actually finding the end of the generic foo.)</p>
<p>2) The regex enhancement can either recognise the content or not.</p>
<p>Does a newer localised definition replace the older one?  The handling of
multiple regestrations has to be resolved. My initial thoughts are that a
&quot;Last registered is checked first&quot; approach may be best.</p>
<p>3) If not the enhancement returns undef and perl goes to the next regex
enhancement (Does it handle the enhancements as a stack (Last checked first)
or a list (First checked first?) how are they scoped?  Job here for the
OO/scoping fanatics)</p>
<p>4) If perl runs out of registered regex enhancements it reports an error.</p>
<a name='Hugo alternative:'></a><h3>Hugo alternative:</h3>
<p>I'd be more inclined to have callbacks registered for a word: that
way we can complain earlier when two modules try to register the
same word. Then at regexp-compile time we parse out the word
following the (+ and immediately know who to pass it to (or fail).</p>
<p>Well, there are limits to what we can handle - earlier, the parser
will have had to be able to determine where the end of the regexp
is. Even specifying a word at the beginning doesn't help: we need
to know whether the rest should look like a regexp, or code, or
whatever else. The regexp compiler doesn't get a look in until
after that has been done.</p>
<p>Which suggests that maybe each callback - whether or not we link
them to words - should specify what it will match, which suggests
it should be linked with a regexp.</p>
<a name='Actions by the Callback'></a><h2>Actions by the Callback</h2>
<p>If an enhancement recognises the content it could do either of:</p>
<p>a) return replacement expanded regex using existing capabilities perl will
then pass this back through the regex compiler.</p>
<p>(+...) loops are allowed though the compiler might want to issue a warning
if they appear to go too deep.</p>
<p>b) return a coderef that is called at run time when the regex gets to this
point.  (?{...}) or (??{...}) or maybe (?*{...}) see RFC 198.</p>
<a name='Embeded Code'></a><h2>Embeded Code</h2>
<p>The referenced code needs to have enough access to the regex internals
to be able to see the current sub-expression, request more characters, access
to relevant flags and visability of greediness.  It may also need a coderef
that is simarly called when the regex is being unwound when it backtracks.
These features would also be of interest to the existing code inside regexes
as well.</p>
<p>Thinking from that - the last case should be generalised (it is sort of
like my (?*{...}) from RFC 198 or an enhancement to (??{...}).  If so cases
(a) and (b) are the same as case (b) is just a case of returning (?*{...}) the
appropriate code.</p>
<p>Access to subexpresions - ok this can be done.</p>
<p>Visability of flags - Not curently possible. The code might like to know that
/i is in effect, it might want to know that /s is in effect it probably does
not need to know about /o.  This is equally true to the enhancement regex
handler that looks at the (+foo) in the first place.  I think that these
could be of use to (?{...}) code.</p>
<p>Greediness - maybe not necessary, but I think better visability of
internals might be beneficial.</p>
<p>Hugo: Hm, I do appreciate the problem - I wasn't too happy when I realised
that embedded qr{} expressions are protected from the flags of their outer
regexp, cos I wanted to specify /i on the outside and have it trickle in to
the rest. It feels like its going to get real messy, though, and totally
screw the optimiser.</p>
<p>Me: This also needs thinking about, but I cant resolve this at the
moment.</p>
<a name='Code execution for backtracking'></a><h2>Code execution for backtracking</h2>
<p>Following on, if (?{...}) etc code is evaluated in forward match, it would be
a good idea to likewise support some code block that is ignored on a forward
match but is executed when the code is unwound due to backtracking.  Thus
(?{ foo })(?\{ bar }) executes foo on the forward case and bar if it unwinds.
I dont care at the moment what the syntax is - what about the concepts. Think
about foo putting something on a stack (eg the bracket to match [RFC 145])
and bar taking it off for example.</p>
<p>These might be acheieved by complex localisation.  Is localisation enough?
Enough to achieve everything you might want to? Yes: you can always
have a (?{ local $a = new Object }) with a DESTROY method. It may not
necessarily be the cleanest possible way to write everything, though.</p>
<p>So functionality for doing this easier might be a good idea.</p>
<a name='CHANGES'></a><h1>CHANGES</h1>
<p>V2 - Several additions of clarity and discussion as to how the content is
recognised - mainly between myself and Hugo.  Everything here has come
from the discussion on the list.</p>
<a name='MIGRATION'></a><h1>MIGRATION</h1>
<p>This is a new feature - no compatibity problems</p>
<a name='IMPLENTATION'></a><h1>IMPLENTATION</h1>
<p>This has not been looked at in detail, but the desciption above provides
some views as to how it may operate.</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>RFC 145: Brace-matching for Perl Regular Expressions</p>
<p>RFC 198: Boolean Regexes</p>
<p>This message from Larry:</p>
<p><a href='http://www.mail-archive.com/perl6-language@perl.org/msg02955.html' target='_blank'>www.mail-archive.com</a></p>
</div>
